Assembly of metabarcode reads

Time
  • Teaching: 30 min
  • Exercises: 10 min
Questions
  • Why should metabarcode data be merged?
  • What is the difference between reads and contigs?
  • How can we merge metabarcode reads?
Objectives
  • Understand what an assembly is
  • Run a metabarcode “assembly” workflow

Create Contigs from Paired-end Reads

We will begin by merging our reads into contigs, followed by filtering and trimming of reads based on quality score and several other metrics. In this experiment, paired-end sequencing of either the V5-V6 region of the 16S rRNA gene or the ITS2 (Internal transcribed spacer) region between the 5.8S and 28S rDNA of the fungal ribosome genes, was performed. These regions are around 316 bp and 210 bp in length, respectively.

Our reads are between 200 and 250 bp in length, therefore this results in a significant overlap between the forward and reverse reads in each pair. We will combine these pairs of reads to produce contigs.

Note, we will just focus on the 16S rRNA metabarcoding reads but it is important to remember the difference between different regions that are amplified using metabarcoding.

Overview of Merging Reads into Contigs

This will be achieved by using the Make.contigs tool from the Mothur toolsuite. Make.contigs will look at each pair, take the reverse complement reverse read, and then determine the overlap between the two sequences. Where an overlapping base call differs between the two reads, the quality score is used to determine the consensus base call. A new quality score is derived by combining the two original quality scores in both of the reads for all the overlapping positions.

To run Make.contigs we can use the default settings, but ensure you select to output the log file.

Data Cleaning

Quality is always a consideration, therefore our next step is to improve the quality of our data. Before doing so, we want to get an overview of our data, similarly to using FastQC. To get a summary of the contigs generated in the previous step, we use Summary.seqs.

Input for this step is the trim.contigs.fasta file from the Make.contigs step. We also want to select Output logfile, as we will use this to get a summary of our data.

As before, the output from Make.contigs has given us long, difficult to decipher names, therefore it would be good to rename the files. This will make it easier to keep track of our data. It is good practise to do this periodically while working through the different data cleaning steps.

Summary of contigs for B_sample_98

The above summary result is from B_sample_98, which is a bacterial detection sample. Going through the results in the logfile, the key things we can gather are as follows:

  • Almost all the contigs are between 204 and 339 bases long (Nbases column, Minimum and 97.5%-tile rows)
  • 97.5% of our reads had no ambiguous base calls (Ambigs column)
  • The longest contig in the dataset is 452 bases (Nbases column, Maximum row)
  • There are 56,573 contigs in our dataset

Our region of interest for this sample is V5-V6 region of the 16S rRNA gene, which is only 316 bases long. Any reads significantly longer than this expected value likely did not assemble well in the Make.contigs step. Furthermore, we see that 2.5% of our reads have at most 21 ambiguous base calls (Ambigs column). In the next steps we will clean up our data by removing these problematic reads.

We do this data cleaning using the Screen.seqs tool, which removes:

  1. sequences with ambiguous bases (maxambig)
  2. contigs shorter than a given threshold (minlength)
  3. contigs longer than a given threshold (maxlength)

Therefore for B_Sample_98, for running Screen.seqs we want to use the following values:

  • maxambig = 0
  • minlength = 291
  • maxlength = 341

Our input is the Make.contigs on data X: trim.contigs.fasta, we can use default settings for all other variables and we want to output the logfile.

Exercise 1: How many reads were removed in this screening step?

Now that we have run the Screen.seqs tool, we want to know what impact this step had on our contigs. Think of what tool you could use to obtain this information

561

This can be obtained by comparing the total number of seqs between the summary log before and after this screening step, generated from running Summary.seqs on the Screen.seqs on data X: good.fasta. Alternatively, you could look at the number of lines in bad.accnos output of Screen.seqs.

Optimise files for computation

Microbiome samples typically contain large numbers of the same organism, and therefore we expect to find many identical sequences in our data. In order to speed up computation, we first determine the unique reads, and then record how many times each of these different reads was observed in the original dataset. We do this by using the Unique.seqs tool.

The input for the Unique.seqs is the “good” fasta produced from running Screen.seqs, we want the output file to be Name File and we want to toggle yes to output the logfile.

Exercise 2: What impact did Unique.seqs have?
  1. How many sequences were unique?
  2. How many dupilcates were removed?

This can be determined by examining the Unique.seqs on data X: logfile, as this displays the total number of sequences examined and the number of outputted unique sequences. For B_Sample_98 this is, 56,012 and 29,268.

For B_Sample_98 the answer is:

  1. 29,268
  2. 26,744

Here we see that this step has greatly reduced the size of our sequence file; not only will this speed up further computational steps, it will also greatly reduce the amount of disk space (and your Galaxy quota) needed to store all the intermediate files generated during this analysis.

This Unique.seqs tool created two files, one is a FASTA file containing only the unique sequences, and the second is a so-called names file. This names file consists of two columns, the first contains the sequence names for each of the unique sequences, and the second column contains all other sequence names that are identical to the representative sequence in the first column.

Unique.seqs Name File

To recap, we now have the following files:

  • a FASTA file containing every distinct sequence in our dataset (the representative sequences)
  • a names file containing the list of duplicate sequences
Comment: Representative sequences vs Total sequences

From now on, we will only work with the set of unique sequences, but it is important to remember that these represent a larger number of total sequences.

In the following we will use the unique sequences as input to tools instead of the complete set of sequences.

Keypoints
  • Assembly groups reads into contigs
  • Paired end reads need to be merged to get full length of region they cover
  • We need to consider the size of the region we amplified for QC of merged reads
  • Merging takes Fastq files as input and produce Fasta files as output